Table 17-1 shows theoretical coding for a data set containing the variables StudyID (for participant
ID) and PrimaryDx (for participant primary diagnosis). As shown in Table 17-1, you take each level
and make an indicator variable for it: Hypertension is HTN, diabetes is Diab, cancer is Cancer, and
other is OtherDx. Instead of including the variable PrimaryDx in the model, you’d include the
indicator variables for all levels of PrimaryDx except the reference level. So, if the reference level
you selected for PrimaryDx was hypertension, you’d include Diab, Cancer, and OtherDx in the
regression, but would not include HTN. To contrast this to the education example, in the set of
variables in Table 17-1, participants can have a 1 for one or more indicator variables or just be in the
reference group. However, with the education example, they can only be coded at one level, or be in
the reference group.
Don’t forget to leave the reference-level indicator variable out of the regression, or your
model will break!
Creating scatter charts before you jump into multiple regression
analysis
One common mistake researchers make is immediately running a regression or another advanced
statistical analysis before thoroughly examining their data. As soon as your data are available in
electronic format, you should run error-checks, and generate summaries and histograms for each
variable you plan to use in your regression. You need to assess the way the values of the variables are
distributed as we describe in Chapter 11. And if you plan to analyze your data using multiple
regression, you need special preparation. Namely, you should chart the relationship between each
predictor variable and the outcome variable, and also the relationships between the predictor
variables themselves.
Imagine that you are interested in whether the outcome of systolic blood pressure (SBP) can be
predicted by age, body weight, or both. Table 17-2 shows a small data file with variables that could
address this research question that we use throughout the remainder of this chapter. It contains the age,
weight, and SBP of 16 study participants from a clinical population.
TABLE 17-2 Sample Age, Weight, and Systolic Blood Pressure Data for a
Multiple Regression Analysis
Participant ID Age (years) Weight (kg) SBP (mmHg)
1
60
58
117
2
61
90
120
3
74
96
145
4
57
72
129
5
63
62
132
6
68
79
130
7
66
69
110
8
77
96
163
9
63
96
136